Project Home

Three datasets in CSV format were downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository. ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

About the Tau-PET Data

Longitudinal 18F-AV-1451 tau-PET data was downloaded from Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 241 columns.

Each row represents one tau-PET scan; some subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 123 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).

The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in orange above), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.

create figure about tauPET/freesurfer/rois

These 123 ROIs were delineated by first co-registering the tau-PET image to a high-resolution structural T1-weighted MPRAGE acquired in the same imaging session, and then applying FreeSurfer (v5.3) for automated regional segmentation and parcellation. Furthermore, to mitigate issues with lower voxel resolution in PET imaging, partial volume correction was applied to use probabilistic tissue segmentation maps to refine individual ROIs. Note: these PET processing steps were all performed by Susan Landau, Deniz Korman, and William Jagust at the Helen Wills Neuroscience Institute, UC Berkeley and Lawrence Berkeley National Laboratory.

About the cognitive data

Longitudinal cognitive data was downloaded from “Study Data/Study Info/Key ADNI tables merged into one table”. This CSV file contains 14,816 rows and 113 columns. This includes many subjects with multiple follow-up visits. Columns include (anonymized) subject demographic information such as age, sex, race, and marriage status.

This project will explore two target features in this dataset:

The CDR scale was initially developed in 1982 at the Washington University as a metric of clinical dementia progression (Hughes et al., 1982). This cognitive test measures impairment in six cognitive domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Morris 1991). Each of these is scored on a five-point scale as follows:

The global score is most heavily influenced by the memory score, though there is an established decision tree-like algorithm for how to calculate the global score. By contrast, the CDR Sum of Boxes reflects the sum of scores for each of the six domains, with an overall range of 0 (no impairment) to 18 (severe impairment). The CDR Sum of Boxes is an extension upon the CDR global score, offering a more detailed quantitative score. This metric reportedly improves precision in longitudinal cognitive tracking, particularly in cases of mild dementia (O’Bryant et al., 2008) Of note, the CDR assessment shows high inter-rater reliability, which is important given the inter-site nature of the ADNI cohort (Morris 1991).

Sources

Hughes, C. P., Berg, L., Danziger, W., Coben, L. A., & Martin, R. L. (1982). A new clinical scale for the staging of dementia. The British journal of psychiatry, 140(6), 566-572.

Morris, J. C. (1991). The clinical dementia rating (cdr): Current version and. Young, 41, 1588-1592.

O’Bryant, S. E., Waring, S. C., Cullum, C. M., Hall, J., Lacritz, L., Massman, P. J., … & Doody, R. (2008). Staging dementia using Clinical Dementia Rating Scale Sum of Boxes scores: a Texas Alzheimer’s research consortium study. Archives of neurology, 65(8), 1091-1095.

   

Tau-PET Data Exploration

Packages used to create the below figures:

library(tidyverse)
library(knitr)
library(kableExtra)
library(DMwR)
library(plotly)
library(lubridate)
library(forcats)
library(readxl)
library(ggsignif)
library(DT)
library(ggpointdensity)
library(rstatix)

Load partial volume corrected regional tau-PET data, as downloaded from ADNI:

tau.df <- read.csv("../../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
str(tau.df)
## 'data.frame':    1120 obs. of  165 variables:
##  $ RID                                   : int  21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE                               : chr  "init" "init" "y1" "init" ...
##  $ VISCODE2                              : chr  "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                              : chr  "2/2/2018" "4/24/2018" "4/23/2019" "2/20/2018" ...
##  $ INFERIOR_CEREBGM_SUVR                 : num  1.32 1.33 1.33 1.28 1.24 ...
##  $ INFERIOR_CEREBGM_VOLUME               : int  52306 54296 54296 56750 56750 56750 59836 56862 56862 56862 ...
##  $ HEMIWM_SUVR                           : num  1.02 0.85 0.866 1.138 1.196 ...
##  $ HEMIWM_VOLUME                         : int  321220 281690 281690 336495 336495 336495 294422 463900 463900 463900 ...
##  $ BRAAK12_SUVR                          : num  2.06 2.24 2.3 1.91 1.88 ...
##  $ BRAAK12_VOLUME                        : int  10275 7587 7587 9376 9376 9376 10379 10981 10981 10981 ...
##  $ BRAAK34_SUVR                          : num  1.95 1.87 1.8 1.82 1.77 ...
##  $ BRAAK34_VOLUME                        : int  95661 95419 95419 92482 92482 92482 94092 112788 112788 112788 ...
##  $ BRAAK56_SUVR                          : num  1.99 1.92 1.84 1.87 1.84 ...
##  $ BRAAK56_VOLUME                        : int  284821 288136 288136 283119 283119 283119 283727 325054 325054 325054 ...
##  $ BRAIN_STEM_SUVR                       : num  1.27 1.12 1.12 1.2 1.17 ...
##  $ BRAIN_STEM_VOLUME                     : int  16955 16952 16952 20508 20508 20492 18057 18872 18872 18866 ...
##  $ LEFT_MIDDLEFR_SUVR                    : num  2.02 1.93 1.8 1.83 1.78 ...
##  $ LEFT_MIDDLEFR_VOLUME                  : int  17640 18517 18517 17164 17164 17164 17683 21907 21907 21907 ...
##  $ LEFT_ORBITOFR_SUVR                    : num  2.17 2.03 1.92 2.11 1.98 ...
##  $ LEFT_ORBITOFR_VOLUME                  : int  11676 10091 10091 11721 11721 11721 10917 12109 12109 12109 ...
##  $ LEFT_PARSFR_SUVR                      : num  2.02 2.01 1.98 2.03 1.99 ...
##  $ LEFT_PARSFR_VOLUME                    : int  9201 7799 7799 9185 9185 9185 7709 9813 9813 9813 ...
##  $ LEFT_ACCUMBENS_AREA_SUVR              : num  1.14 1.04 1.79 1.12 1.18 ...
##  $ LEFT_ACCUMBENS_AREA_VOLUME            : int  500 318 318 308 308 308 353 361 361 361 ...
##  $ LEFT_AMYGDALA_SUVR                    : num  1.31 1.54 1.63 1.42 1.37 ...
##  $ LEFT_AMYGDALA_VOLUME                  : int  1367 1224 1224 1561 1561 1561 993 1499 1499 1499 ...
##  $ LEFT_CAUDATE_SUVR                     : num  2.08 1.46 1.34 1.95 1.83 ...
##  $ LEFT_CAUDATE_VOLUME                   : int  3016 4890 4890 3083 3083 3083 2874 4049 4049 4049 ...
##  $ LEFT_HIPPOCAMPUS_SUVR                 : num  2.12 1.96 2.2 1.69 1.73 ...
##  $ LEFT_HIPPOCAMPUS_VOLUME               : int  3822 3050 3050 3476 3476 3476 3603 3550 3550 3550 ...
##  $ LEFT_PALLIDUM_SUVR                    : num  3.79 1.89 1.95 2.5 2.6 ...
##  $ LEFT_PALLIDUM_VOLUME                  : int  444 2066 2066 1301 1301 1301 1081 1634 1634 1634 ...
##  $ LEFT_PUTAMEN_SUVR                     : num  1.69 1.64 1.42 1.9 1.78 ...
##  $ LEFT_PUTAMEN_VOLUME                   : int  4000 5675 5675 4832 4832 4832 3563 4891 4891 4891 ...
##  $ LEFT_THALAMUS_PROPER_SUVR             : num  1.45 1.32 1.24 1.54 1.53 ...
##  $ LEFT_THALAMUS_PROPER_VOLUME           : int  8226 6195 6195 7114 7114 7114 7561 7518 7518 7518 ...
##  $ RIGHT_MIDDLEFR_SUVR                   : num  2.08 1.91 1.8 1.94 1.85 ...
##  $ RIGHT_MIDDLEFR_VOLUME                 : int  17250 18440 18440 15605 15605 15605 16280 22586 22586 22586 ...
##  $ RIGHT_ORBITOFR_SUVR                   : num  2.19 2.01 1.86 2.17 2.03 ...
##  $ RIGHT_ORBITOFR_VOLUME                 : int  11614 12637 12637 11064 11064 11064 11537 12575 12575 12575 ...
##  $ RIGHT_PARSFR_SUVR                     : num  2.17 2.08 1.9 2.09 2.01 ...
##  $ RIGHT_PARSFR_VOLUME                   : int  9255 8131 8131 9641 9641 9641 8839 9119 9119 9119 ...
##  $ RIGHT_ACCUMBENS_AREA_SUVR             : num  1.41 1.65 1.66 1.01 1.07 ...
##  $ RIGHT_ACCUMBENS_AREA_VOLUME           : int  545 413 413 423 423 423 542 528 528 528 ...
##  $ RIGHT_AMYGDALA_SUVR                   : num  1.18 1.79 1.89 1.37 1.44 ...
##  $ RIGHT_AMYGDALA_VOLUME                 : int  1268 1028 1028 1464 1464 1464 1313 1797 1797 1797 ...
##  $ RIGHT_CAUDATE_SUVR                    : num  2.01 1.57 1.37 1.96 1.89 ...
##  $ RIGHT_CAUDATE_VOLUME                  : int  3179 4854 4854 2984 2984 2984 3224 3835 3835 3835 ...
##  $ RIGHT_HIPPOCAMPUS_SUVR                : num  2.01 2.09 2.03 1.62 1.64 ...
##  $ RIGHT_HIPPOCAMPUS_VOLUME              : int  3978 2723 2723 3489 3489 3489 3667 3942 3942 3942 ...
##  $ RIGHT_PALLIDUM_SUVR                   : num  3.01 2.32 2.12 2.33 2.48 ...
##  $ RIGHT_PALLIDUM_VOLUME                 : int  846 1531 1531 1262 1262 1262 1088 1552 1552 1552 ...
##  $ RIGHT_PUTAMEN_SUVR                    : num  1.68 1.62 1.53 2.06 1.94 ...
##  $ RIGHT_PUTAMEN_VOLUME                  : int  4322 5774 5774 4328 4328 4328 3190 4569 4569 4569 ...
##  $ RIGHT_THALAMUS_PROPER_SUVR            : num  1.42 1.33 1.24 1.52 1.55 ...
##  $ RIGHT_THALAMUS_PROPER_VOLUME          : int  5968 5442 5442 5940 5940 5940 6257 7899 7899 7899 ...
##  $ CHOROID_SUVR                          : num  7.45 4.56 4.31 3.84 3.79 ...
##  $ CHOROID_VOLUME                        : int  4180 3591 3591 3165 3165 3165 3717 3663 3663 3663 ...
##  $ CTX_LH_BANKSSTS_SUVR                  : num  1.75 1.49 1.6 1.7 1.63 ...
##  $ CTX_LH_BANKSSTS_VOLUME                : int  1553 1633 1633 1812 1812 1812 1694 2601 2601 2601 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR   : num  1.67 1.73 1.65 1.69 1.69 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : int  1138 1387 1387 1124 1124 1124 1465 1512 1512 1512 ...
##  $ CTX_LH_CUNEUS_SUVR                    : num  2.33 2.2 2.05 2.01 2 ...
##  $ CTX_LH_CUNEUS_VOLUME                  : int  2023 2702 2702 2429 2429 2429 2393 2222 2222 2222 ...
##  $ CTX_LH_ENTORHINAL_SUVR                : num  2.07 2.3 2.43 2.79 2.52 ...
##  $ CTX_LH_ENTORHINAL_VOLUME              : int  1468 1035 1035 1068 1068 1068 1297 1888 1888 1888 ...
##  $ CTX_LH_FUSIFORM_SUVR                  : num  1.97 1.87 1.83 1.84 1.77 ...
##  $ CTX_LH_FUSIFORM_VOLUME                : int  7956 6997 6997 7694 7694 7694 7807 9083 9083 9083 ...
##  $ CTX_LH_INFERIORPARIETAL_SUVR          : num  1.99 1.95 1.94 1.85 1.89 ...
##  $ CTX_LH_INFERIORPARIETAL_VOLUME        : int  11656 10174 10174 9243 9243 9243 8180 9846 9846 9846 ...
##  $ CTX_LH_INFERIORTEMPORAL_SUVR          : num  2.16 1.97 2.05 2.1 2.02 ...
##  $ CTX_LH_INFERIORTEMPORAL_VOLUME        : int  6606 6418 6418 7286 7286 7286 6869 9599 9599 9599 ...
##  $ CTX_LH_INSULA_SUVR                    : num  1.51 1.64 1.65 1.51 1.48 ...
##  $ CTX_LH_INSULA_VOLUME                  : int  6711 4654 4654 6003 6003 6003 5513 6597 6597 6597 ...
##  $ CTX_LH_ISTHMUSCINGULATE_SUVR          : num  1.9 1.81 1.82 1.79 1.94 ...
##  $ CTX_LH_ISTHMUSCINGULATE_VOLUME        : int  2283 2215 2215 1549 1549 1549 1944 2264 2264 2264 ...
##  $ CTX_LH_LATERALOCCIPITAL_SUVR          : num  2.39 2.06 1.99 1.92 2 ...
##  $ CTX_LH_LATERALOCCIPITAL_VOLUME        : int  8532 10148 10148 8292 8292 8292 10612 9404 9404 9404 ...
##  $ CTX_LH_LINGUAL_SUVR                   : num  2.27 1.95 1.97 1.76 1.74 ...
##  $ CTX_LH_LINGUAL_VOLUME                 : int  4329 4658 4658 5606 5606 5606 5435 6488 6488 6488 ...
##  $ CTX_LH_MIDDLETEMPORAL_SUVR            : num  2.2 2.06 1.89 2.04 1.99 ...
##  $ CTX_LH_MIDDLETEMPORAL_VOLUME          : int  7445 8322 8322 7292 7292 7292 8031 9467 9467 9467 ...
##  $ CTX_LH_PARACENTRAL_SUVR               : num  1.99 1.79 1.8 1.91 1.8 ...
##  $ CTX_LH_PARACENTRAL_VOLUME             : int  2672 2890 2890 3231 3231 3231 3358 3173 3173 3173 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_SUVR           : num  1.6 1.86 1.92 1.72 1.66 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_VOLUME         : int  1659 1549 1549 1900 1900 1900 1989 2296 2296 2296 ...
##  $ CTX_LH_PERICALCARINE_SUVR             : num  2.23 1.45 1.41 1.56 1.54 ...
##  $ CTX_LH_PERICALCARINE_VOLUME           : int  1678 2004 2004 1866 1866 1866 1918 1927 1927 1927 ...
##  $ CTX_LH_POSTCENTRAL_SUVR               : num  2.03 1.81 1.82 1.85 1.78 ...
##  $ CTX_LH_POSTCENTRAL_VOLUME             : int  8281 8428 8428 8275 8275 8275 7580 8976 8976 8976 ...
##  $ CTX_LH_POSTERIORCINGULATE_SUVR        : num  1.82 1.89 1.84 1.72 1.67 ...
##  $ CTX_LH_POSTERIORCINGULATE_VOLUME      : int  2439 2608 2608 2683 2683 2683 2573 2638 2638 2638 ...
##  $ CTX_LH_PRECENTRAL_SUVR                : num  1.91 1.85 1.75 1.62 1.61 ...
##  $ CTX_LH_PRECENTRAL_VOLUME              : int  11174 12349 12349 10924 10924 10924 10820 12307 12307 12307 ...
##  $ CTX_LH_PRECUNEUS_SUVR                 : num  1.93 1.89 1.94 1.81 1.81 ...
##  $ CTX_LH_PRECUNEUS_VOLUME               : int  7870 8313 8313 8387 8387 8387 8311 8584 8584 8584 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR  : num  1.71 1.58 1.49 1.59 1.48 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: int  2928 2448 2448 1695 1695 1695 2466 2915 2915 2915 ...
##  $ CTX_LH_SUPERIORFRONTAL_SUVR           : num  1.86 1.86 1.74 1.84 1.77 ...
##   [list output truncated]

18F-AV-1451 is a relatively recent PET tracer, and was only incorporated into the ADNI-3 pipeline beginning in 2016. I am curious about the temporal distribution of the FreeSurfer-analyzed scans here:

tau.df %>%
  select(RID, EXAMDATE) %>%
  mutate(Scan_Date = as.Date(EXAMDATE, format="%m/%d/%Y")) %>%
  plot_ly(x=~Scan_Date, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Tau-PET Scan Date Distribution',
         xaxis = list(title = 'Scan Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of PET Scans')) 

Even though ADNI3 officially began in 2016, most scans were acquired from mid-2017 to present day. This doesn’t affect this analysis since the same protocol has been observed.

Since this project will explore tau-PET measurements over time, I will be refining the dataset to only subjects with multiple tau-PET scans. Here’s the overall distribution of number of longitudinal scans by subject:

p_num_long <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_scans, .desc=T), y=n_scans)) +
  geom_bar(stat="identity", aes(fill=n_scans, color=n_scans)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of Longitudinal PET Scans per Subject") +
  ylab("Number of PET Scans") +
  xlab("Subject") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

ggplotly(p_num_long)
rm(p_num_long)

The majority of subjects only had one tau-PET scan; given the longitudinal nature of this project, such subjects will eventually be omitted from analysis. On that note, it’s important to know exactly how many subjects do have at least two tau-PET scans:

num_scans <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_scans=sum(n_scans))
cat("Number of subjects with at least two scans: **", 
    num_scans$num_subjects, "**\n", 
    "\nNumber of total PET scans: **", 
    num_scans$total_scans, "**\n", sep="")

Number of subjects with at least two scans: 249

Number of total PET scans: 593

So, we have 249 subjects with two or more scans. Another important consideration is the length of time between each consecutive scan. I will eventually normalize changes in tau-PET to number of years passed to yield an annual rate of change, but it’s good to know what that time interval is:

p_pet_interval <- tau.df %>%
  select(RID, EXAMDATE) %>%
  mutate(Scan_Date = as.Date(EXAMDATE, format="%m/%d/%Y")) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  mutate(Years_between_Scans = 
           as.numeric((Scan_Date - lag(Scan_Date, 
                                       default = Scan_Date[1]))/365)) %>%
  filter(Years_between_Scans>0) %>%
  ggplot(., aes(x=Years_between_Scans)) +
  geom_histogram(stat="count", color="lightslategray") +
  ggtitle("Years in between Tau-PET Scans per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive scans for a subject") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 
ggplotly(p_pet_interval)
rm(p_pet_interval)

There’s a cluster of scans around the one-year mark. Presumably, ADNI3 participants are enrolled in an annual tau-PET plan, though in some cases scans aren’t at precise yearly intervals.

I’ll check if there are any missing data points for tau-PET SUVR values for any of the FreeSurfer-derived cortical or subcortical regions of interest (ROIs). Note: this is filtered to show only subjects with at least two scans:

tau.df %>%
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  ungroup() %>%
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  group_by(ROI) %>%
  summarise(`Percent Missing` = sum(is.na(SUVR))/n(),
            `Number Missing` = sum(is.na(SUVR))) %>%
  datatable()

All regions have zero missing data points, so no imputation will be necessary. Now, I’ll check the distribution of tau-PET uptake values across the ROIs.

p_roi_suvr <- tau.df %>%
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  mutate(ROI=tolower(ROI)) %>%
  group_by(ROI) %>%
  summarise(Mean_SUVR=mean(SUVR, na.rm=T),
            SD_SUVR = sd(SUVR, na.rm=T),
            ymin = Mean_SUVR-SD_SUVR,
            ymax = Mean_SUVR+SD_SUVR) %>%
  ggplot(data=., mapping=aes(x=fct_reorder(ROI, Mean_SUVR, .desc=F), 
                             y=Mean_SUVR,
                             label = ROI)) +
  geom_bar(stat="identity", show.legend=F, fill="lightsteelblue") +
  geom_errorbar(aes(ymin=ymin, ymax=ymax), width=0, color="lightslategray") +
  coord_flip() +
  theme_minimal() +
  ylab("Mean Tau-PET SUVR") +
  xlab("Region of Interest") +
  ggtitle("Mean Tau-PET SUVR by ROI") +
  theme(axis.text.x=element_text(size=8, angle=45))


ggplotly(p_roi_suvr, height=1000, width=600, tooltip=c("label", "y"))
rm(p_roi_suvr)

These values are supposed to be normalized to the inferior cerebellum gray matter, indicated by INFERIOR_CEREBGM_SUVR. To confirm, I’ll check the distribution of INFERIOR_CEREBGM_SUVR values.

p_inf_cb <- tau.df %>%
  select(-VISCODE, -VISCODE2, -update_stamp) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  filter(ROI=="INFERIOR_CEREBGM") %>%
  ggplot(data=., mapping=aes(x=SUVR)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("Number of Occurences") +
  xlab("Inferior Cerebellum Gray SUVR") +
  ggtitle("Distribution of Inferior Cerebellum Gray Matter Tau Uptake") +
  theme(plot.title=element_text(hjust=0.5))

ggplotly(p_inf_cb)
rm(p_inf_cb)

Most of the inferior cerebellum gray ROIs actually have SUVRs around 1.25. I’ll re-normalize all ROI values to the mean of this distribution in Data Preparation.

Subject demographics data exploration

ADNI compiled a merged dataset containing key information from several tables, including subject demographics, selected cognitive assessment scores, and select biomarker data.

subj.info <- read.csv("../../ADNI_Data/Raw_Data/ADNIMERGE.csv", stringsAsFactors = T, na.strings="")
subj.info.fields <- read_excel("../../ADNI_Data/Cleaned_Data/adnimerge_dict_mod.xlsx")
Without showing any of the data itself, here are the dataset features and corresponding descriptions detailed by ADNI:
datatable(subj.info.fields)

The columns I will be using for this project include:

subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, format="%d/%m/%Y")
summary(subj.info %>% select(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB, DX))
##       RID          VISCODE        EXAMDATE               AGE          PTGENDER        CDRSB       
##  Min.   :   2   bl     :2272   Min.   :2005-02-12   Min.   :54.40   Female:6593   Min.   : 0.000  
##  1st Qu.: 685   m12    :1836   1st Qu.:2008-12-08   1st Qu.:69.10   Male  :8223   1st Qu.: 0.000  
##  Median :2074   m06    :1618   Median :2012-06-11   Median :73.50                 Median : 1.000  
##  Mean   :2613   m24    :1451   Mean   :2012-04-28   Mean   :73.48                 Mean   : 2.083  
##  3rd Qu.:4513   m18    :1292   3rd Qu.:2014-07-01   3rd Qu.:78.30                 3rd Qu.: 3.000  
##  Max.   :6874   m36    : 857   Max.   :2020-12-05   Max.   :91.40                 Max.   :18.000  
##                 (Other):5490   NA's   :8703         NA's   :4                     NA's   :4172    
##         DX      
##  CN      :3557  
##  Dementia:2313  
##  MCI     :4601  
##  NA's    :4345  
##                 
##                 
## 

There is a lot of missing data in this dataset – however, this dataset includes many subjects and visit dates that don’t correspond to tau-PET scans, and therefore won’t be used in this analysis. Missingness will be re-evaluated once the PET data and subject demographic data is merged in Data Preparation. One thing I noticed, though, is that the “maximum” (i.e. latest) date in the dataset is December 5th 2020, which clearly has not happened yet – this likely reflects incorrect data entry. Upon further examination:

subj.info %>%
  select(RID, EXAMDATE) %>%
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Subject Demographics Date Distribution',
         xaxis = list(title = 'Visit Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of Subjects')) 

There are 47 entries between July 2020 and December 2020, all of which likely reflect incorrectly-entered data. Fortunately, visits are also identified by a visit code describing the length of time from baseline, which can be used to merge these visits with the tau-PET Data. Another thing to note is that tau-PET was only incorporated into the ADNI pipeline in late 2015/early 2016, so any demographic information from pre-2016 will not be included in modeling and analysis.

Let’s check how many subjects had more than one visit recorded in this dataset:

p_subj_long <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_exams, .desc=T), y=n_exams, label=RID)) +
  geom_bar(stat="identity", aes(fill=n_exams, color=n_exams)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of ADNI Visits per Subject") +
  ylab("Number of Visits") +
  xlab("Subjects") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

ggplotly(p_subj_long, tooltip = c("y"))
rm(p_subj_long)

Unlike with the PET data, most subjects have two or more visits recorded with cognitive and demographic information. The most ADAS-13 scores recorded for one subject maxes out at 21. To examine precisely how many subjects have at least two visits recorded:

num.subj <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_exams=sum(n_exams))
cat("Number of subjects with at least two ADNI visits: **", 
    num.subj$num_subjects, "**\n", 
    "\nTotal number of longitudinal ADNI visits recorded: **", 
    num.subj$total_exams, "**\n", sep="")

Number of subjects with at least two ADNI visits: 2027

Total number of longitudinal ADNI visits recorded: 14571

There are 2,027 subjects with two or more ADNI visits in this dataset. This should include all subjects and visit dates included in the tau-PET dataset, which will be confirmed upon merging the datasets in the Data Preparation stage.

It’s also worth checking the distribution of time interval between ADNI visits in these 2,027 subjects with two or more visits:

p.subj.interval <- subj.info %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  arrange(EXAMDATE) %>%
  mutate(Years_between_ADNI = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  filter(Years_between_ADNI>0) %>%
  ggplot(., aes(x=Years_between_ADNI)) +
  geom_histogram(stat="count", fill="lightsteelblue", color="lightslategray") +
  ggtitle("Years in between ADNI visits per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive ADNI visits") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

ggplotly(p.subj.interval)
rm(p.subj.interval)

Interestingly, there are clear peaks around 0.5 (six months) and 1 (one year), and a smaller peak around 2 (two years). This indicates most subjects had a baseline assessment and two follow-up assessments at six-month intervals, with a subset going for a two-year follow-up as well. However, there is a positive skew to this distribution showing that a portion of subjects went up to eight years in between visits. This is not likely to affect my analysis, as most tau-PET subjects had PET scans from 2018 onward, and would therefore have a follow-up interval of two years or less.

Now, I’ll look into the distribution of the two target variables – DX and CDRSB – and how they relate to other covariates, namely age and sex. These visualizations are filtered to show only those subjects with 2+ assessments.

p.cdr.scores <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("CDR-Sum of Boxes") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution") +
  theme(plot.title=element_text(hjust=0.5))

ggplotly(p.cdr.scores)
rm(p.cdr.scores)
p.cog.diag <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=DX)) +
  geom_histogram(stat="count", aes(fill=DX)) +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("Diagnosis") +
  ggtitle("Cognitive Diagnosis Distribution") +
  theme(legend.position="none", plot.title=element_text(hjust=0.5))

ggplotly(p.cog.diag)
rm(p.cog.diag)

The majority of subjects are diagnosed with MCI (mild cognitive impairment). Many subjects do not have a current cognitive diagnosis; this will be revisited once the tau-PET and subject demographics datasets are merged.

Now, to stratify by age and sex, respectively:

p.cdr.sex.violin <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=PTGENDER, y=CDRSB)) +
  geom_violin(aes(fill=PTGENDER)) +
  theme_minimal() +
  ylab("CDR Sum of Boxes Score") +
  xlab("Biological Sex") +
  geom_signif(map_signif_level = F,
              test="t.test",
              comparisons=list(c("Female", "Male"))) +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

p.cdr.sex.hist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count.., fill=PTGENDER)) +
  facet_wrap(PTGENDER ~ ., ncol=1) +
  theme_minimal() +
  ylab("Number of Subjects") +
  xlab("CDR Sum of Boxes Score") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")


p.cdr.sex.violin <- ggplotly(p.cdr.sex.violin)
p.cdr.sex.hist <- ggplotly(p.cdr.sex.hist)
subplot(p.cdr.sex.violin, p.cdr.sex.hist, shareX=F, shareY=F,
        titleX=T, titleY=T) %>% 
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "CDR Sum of Boxes Score", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)))

rm(p.cdr.sex.hist, p.cdr.sex.violin)

The distribution of CDR Sum of Boxes score looks similar between Females and Males by eye, but I’ll follow up with a t-test to confirm:

t.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 

t.test(CDRSB ~ PTGENDER, data=t.test.df)
rm(t.test.df)
## 
##  Welch Two Sample t-test
## 
## data:  CDRSB by PTGENDER
## t = -2.804, df = 9542.1, p-value = 0.005057
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.27228130 -0.04822416
## sample estimates:
## mean in group Female   mean in group Male 
##             2.000650             2.160903

In fact, there is actually a statistically significant (p<0.05) difference in CDR Sum of Boxes scores between males and females, with male subjects exhibiting an average score ~0.15 points higher than female subjects. This is an important consideration, and I will be sure to include sex as a covariate in prediction models where applicable. Similarly, I’ll compare cognitive diagnosis by sex:

p.cog.sex.abs <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  plot_ly(x=~DX, color=~PTGENDER, type="histogram") %>%
  layout(xaxis=list(title="Cognitive Diagnosis"),
         yaxis=list(title="Number of Occurrences by Sex"))

p.cog.sex.rel <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  group_by(PTGENDER) %>%
  mutate(n_gen = n()) %>%
  group_by(PTGENDER, DX) %>%
  summarise(DX_prop = mean(n()/n_gen)) %>%
  plot_ly(x=~DX, y=~DX_prop, color=~PTGENDER, type="bar", showlegend=F) %>%
  layout(xaxis=list(title="Cognitive Diagnosis"),
         yaxis=list(title="Proportion of All Visits by Sex"))

subplot(p.cog.sex.abs, p.cog.sex.rel, shareX=T, shareY=F,
        titleX=F, titleY=T, nrows=2) %>% 
  layout(yaxis=list(title="Total Number of Occurrences", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Proportion of All Visits by Sex", 
                       titlefont = list(size = 12)),
         xaxis = list(tickfont=list(size=16)),
         title = "Cognitive Diagnosis Distribution by Sex",
         legend = list(orientation = "h"))

rm(p.cog.sex.abs, p.cog.sex.rel)

Similarly, I’ll compare CDR Sum of Boxes with age:

p.age.cdr <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE, y=CDRSB)) +
  labs(color="Number of Points") +
  xlab("Age at Visit") +
  ylab("CDR Sum of Boxes") +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  geom_point(size=1, alpha=0.2, color="lightslategray", fill="lightslategray") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

p.age.dist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE)) +
  xlab("Number of Occurrences") +
  ylab("Age at Visit") +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  theme(plot.title=element_text(hjust=0.5))
p.age.cdr <- ggplotly(p.age.cdr)
p.age.dist <- ggplotly(p.age.dist)
subplot(p.age.cdr, p.age.dist, shareX=F, shareY=F,
        titleX=T, titleY=T, margin = 0.05) %>% 
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "Age at Visit", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         autosize = F, width = 800, height = 500)

rm(p.age.cdr, p.age.dist)

While there doesn’t appear to be any clear linear association between age at visit and CDR Sum of Boxes score, I’ll use cor.test to calculate the Pearson correlation coefficient and the corresponding p-value based on the correlation coefficient t-statistic:

cor.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 

cor.test(cor.test.df$AGE, cor.test.df$CDRSB)
## 
##  Pearson's product-moment correlation
## 
## data:  cor.test.df$AGE and cor.test.df$CDRSB
## t = 6.8375, df = 10395, p-value = 8.512e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04775209 0.08602444
## sample estimates:
##        cor 
## 0.06691288

Interestingly, this correlation coefficient is statistically significant (p=8.512e-12), but the effect size is very small (r~=0.067). This significance seems to reflect the sheer size of the dataset rather than a strong relationship between age and CDR sum of boxes scores. Nevertheless, I will explore the use of age as a covariate in modeling later as this is a common practice.

Looking also at age as a function of cognitive diagnosis:

p.dx.age <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=DX, y=AGE)) +
  geom_boxplot(aes(fill=DX)) +
  ylab("Age at Visit") +
  xlab("Cognitive Diagnosis") +
  theme_minimal() +
  ggtitle("Age at Visit Distribution by Cognitive Diagnosis") +
  theme(plot.title=element_text(hjust=0.5))

ggplotly(p.dx.age)
rm(p.dx.age)

There is no glaring difference by eye in terms of age at visit by diagnostic group. This can also be confirmed with ANOVA (excluding NA diagnosis):

anova.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  filter(!is.na(DX))

age.dx.anova <- aov(AGE~DX, data=anova.df)
age.dx.anova

print("Tukey's HSD:\n")
age.dx.anova %>% tukey_hsd() %>% 
  mutate_at(c("estimate", "conf.low", "conf.high", "p.adj"), function(x) round(x,5))  %>%
  datatable()
## Call:
##    aov(formula = AGE ~ DX, data = anova.df)
## 
## Terms:
##                       DX Residuals
## Sum of Squares    2261.3  500977.0
## Deg. of Freedom        2     10241
## 
## Residual standard error: 6.994195
## Estimated effects may be unbalanced
## 2 observations deleted due to missingness
## [1] "Tukey's HSD:\n"

As it turns out, there are significant differences in age between the respective diagnostic groups. In addition to sex, I will include age as a covariate where applicable in modeling.

Outlier detection

To better identify outliers based on this multivariate dataset, I will calculate Cook’s distance for each individual PET scan once the tau-PET data is merged with the cognitive status data.